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Abstract 

In this paper we investigate the usage of regularized correntropy framework for 
learning of classifiers from noisy labels. The class label predictors learned by 
minimizing transitional loss functions are sensitive to the noisy and outlying 
labels of training samples, because the transitional loss functions are equally 
applied to all the samples. To solve this problem, we propose to learn the class 
label predictors by maximizing the correntropy between the predicted labels 
and the true labels of the training samples, under the regularized Maximum 
Correntropy Criteria (MCC) framework. Moreover, we regularize the predictor 
parameter to control the complexity of the predictor. The learning problem is 
formulated by an objective function considering the parameter regularization 
and MCC simultaneously. By optimizing the objective function alternately, we 
develop a novel predictor learning algorithm. The experiments on two chal¬ 
lenging pattern classification tasks show that it significantly outperforms the 
machines with transitional loss functions. 
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1. Introduction 


The classification machine design has been a basic problem in the pattern 
recognition field. It tries to learn an effective predictor to map the feature vector 
of a sample to its class label [l|, 3, 3, d| 3, fl ^,3,3, 10]. We study the supervised 
multi-class learning problem with L classes. Suppose we have a training set de¬ 
noted as V = {(xi, yi)}, i= 1, ■ ■ • ,N, where Xi = [xn, • • • , Xio] T £ R D is the D 
dimensional feature vector of the i-th training sample, and yt £ {1, - - - , L} is the 
class label of i- th training sample. Moreover, we also denote the label indicator 
matrix as Y = [Yu] £ R Lx7V , and Yu = 1 if t/j = l, and —1 otherwise. We try to 
learn L class label predictors {fg(x)}, l = 1, • • ■ , L for the multi-class learning 
problem, where fjj(x) is the predictor for the l -th class and 9 is its parameter. 
Given a sample Xi, the output of the Gth predictor is denoted as fg{xi), and 
we further denote the prediction result matrix as Fg = [Fg h ] £ R LxJV , and 
Fg ti = f l g(xi). To make the prediction as precise as possible, the target of pre¬ 
dictor learning is to learn parameter 0 , so that the difference between true class 
labels of the training samples in Y and the prediction results in Fg could be 
minimized, while keeping the complexity of the predictor as low as possible. To 
measure how well the prediction results fit the true class label indicator, several 
loss functions L(Fg,Y ) could be considered to compare the prediction results 
in Fg against the true class labels of the training samples in Y, such as the 0-1 
loss function, the square loss function, the hinge loss function, and the logistic 
loss function. We summarize various loss functions in Table [1] 

These loss functions introduced in Table |T] have been used widely in various 
learning problems. One common feature of these loss function is that a sample- 
wise loss function is applied to each training sample equally and then the losses 
of all the samples are summed up to obtain the final overall loss. The sample- 
wise loss functions are of exactly the same form with the same parameter (if 
they have parameters). The basic assumption behind this loss function is that 
the training samples are of the same importance. However, due to the limitation 
of the sampling technology and noises occurred during the sampling procedure, 
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Table 1: Various empirical loss functions for predictor learning 


Title 

Formula of L(Fg,Y) 

Notes 

0-1 Loss 

< 0], where 

I(-) is the indicator function 

and I(-) == 1 if (•) is true, 0 

otherwise. 

The 0-1 loss function is NP-hard to op¬ 
timize, non-smooth and non-convex. 

Square Loss 

Ei,i[F eii - Yu } 2 = lil¬ 
y’ll 2 , where o denotes the el¬ 
ement wise product of two 

matrices, and ljvxLisalVx 

L matrix with all elements 

of ones. 

The square loss function is a convex up¬ 
per bound on the 0-1 loss. It is smooth 

and convex, thus easy to optimize. 

Hinge Loss 

- FeuYu]+ 

1at[1jVxL — Fg O Y] + 1 l 

where [x]+ = max(0,x), 

and In E R w is a column 

vector with all ones. 

The hinge loss function is not smooth 

but subgradient descent can be used to 

optimize it. It is the most common loss 

function in SVM. 

Logistic 

Loss 

E M Mi + e~ FeuYu ] = 

1 Jfln [ljvxi + e~ Fe ° Y ] 1 L 

This loss function is also smooth and 

convex, and is usually used in regression 

problem. 

there are some 

noisy and outlying samples in 

real-world applications. If we use 


the transitional loss functions listed in Table [l] the noisy and outlying training 
samples will play more important roles even than the good samples. Thus the 
predictors learned by minimizing the transitional loss functions are not robust 
to the noisy and outlying training samples, and could bring a high error rate 
when applied to the prediction of test samples. 

Recently, regularized correntropy framework has been proposed for robust 


Q 


V 


ia 


14]. In [151, He et, al argued that the 


pattern recognition problems 
classical mean square error (MSE) criterion is sensitive to outliers, and intro- 
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duced the correntropy to improve the robustness of the presentation. Moreover, 
the l\ regularization scheme is imposed on the correntropy to learn robust and 
sparse representations. Inspired by their work, we propose to use the regularized 
correntropy as a criterion to compare the prediction results and the true class 
labels. We use correntropy to compare the predicted labels and the true labels, 
instead of comparing the feature of test sample and its reconstruction from the 
training samples in He et, al’s work. Moreover, an I 2 norm regularization is 
introduced to control the complexity of the predictor. In this way, the predictor 
learned by maximizing the correntropy between prediction results and the true 
labels will be robust to the noisy and outlying training samples. The proposed 
classification Machine Maximizing the Regularized CorrEntropy, which is called 
RegMaxCEM, is supposed to be more insensitive to outlining samples than the 
ones with transitional loss functions. Yang et, al. JhJ also proposed to use 
correntropy to compare predicted class labels and true labels. However, in their 
framework, the target is to learn the class labels of the unlabeled samples in a 
transductive semi-supervised manner, while we try to learn the parameters for 
the class label predictor in a supervised manner. 

The rest of this paper is structured as follows: In Section [2j we propose the 
regularized maximum correntropy machine by constructing an objective func¬ 
tion based on the maximum correntropy criterion (MCC) and developing an 
expectation - maximization (EM) based alternative algorithm for its optimiza¬ 
tion. In Section [3l the proposed methods are validated by conducting extensive 
experiments on two challenging pattern classification tasks. Finally, we give the 
conclusion in Section |4] 

2. Regularized Maximum Correntropy Machine 

In this section we will introduce the classification machine maximizing the 
correntropy between the predicted class labels and the true class labels, while 
keeping the solution as simple as possible. 
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2.1. Objective Function 

To design the predictors f l g (x), we first represent the data sample a; as if in 
the linear space and the kernel space as: 

{ x, (linear), 

(1) 

K(-,x), (kernel), 

where K(-,x) = [K(x\, x), ■ ■ • ,K(xn,x)} t £ R w and K(xi,Xj) is a kernel 
function between Xi and Xj. Then a linear predictor f g (x) will be designed to 
predict whether the sample belongs to the Z-th class as 

fe(x) =wjx + b u l = l,--- ,L, (2) 

where 9 = {(wi, Z>z)}^ =1 is the parameters of the predictors, wi £ R D is the linear 
coefficient vector and bi £ R is a bias term for the Z-th predictor. The target 
of predictor designing is to find the optimal parameters to have the prediction 
result f g (xi) of the Z-th sample to fit its true class label indicator Yu as well 
as possible, while keeping the solution as simple as possible. To this end, we 
consider the following two problems simultaneously when designing the objective 
function: 

Prediction Accuracy Criterion based on Correntropy To consider the pre¬ 
diction accuracy, we could learn the predictor parameters by minimizing 
a loss function listed in Table [1] as 

minL(F g ,Y) (3) 

0 

However, as we mentioned in Section[l] all these loss functions are applied 
to all the training samples equally, which is not robust to the noisy samples 
and outlying samples. To handle this problem, instead of minimizing a 
loss function to learn the predictor, we use the MCC [llj framework to 
learn the predictor by maximizing the correntropy between the predicted 
results and the true labels. 
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Remark 1: In previous studies, it has been claimed that the MCC is in¬ 
sensitive to outliers. For example, in 11|, it is claimed that “the maximum 
correntropy criterion, ... is much more insensitive to outliers.” Based on 
this fact, we assume that the predictors developed based on MCC should 
also be insensitive to outliers. 

Correntropy is a generalized similarity measure between two arbitrary 
random variables A and B. However, the joint probability density function 
of A and B is usually unknown, and only a finite number of samples 
of them are available as {(a*, 6,;)}f =1 . It leads to the following sample 
estimator of correntropy: 



is a kernel width parameter. For a learning system, MCC is defined as 



where •& is the parameter to be optimized in the criterion so that B is as 
correlated to A as possible. 

Remark 2: d is usually a parameter to define B , but not the kernel 
function parameter a. In the learning system, we try to learn d so that 
with the learned d, B is correlated to A. For example, in this case, A is 
the true class label matrix while B is the predicted class label matrix, and 
i9 is the predictor parameter to define B. 

To adapt the MCC framework to the predictor learning problem, we let A 
be the prediction result matrix Fg parameterized by 0 , and B be the true 
class label matrix Y , and we want to find the predictor parameter 6 such 
that Fg becomes as correlated to Y as possible under the MCC framework. 
Then, the following correntropy-based predictor learning model will be 
obtained: 
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( 6 ) 


maxV(Fg,Y), 

0 


V(Fg,Y) = 


1 


N 


Lx N 




1=1 2=1 


Please notice that in 


jll|. MCC is used to measure the similarity between a 
test sample and its sparse linear representation of training samples, while 
in this work it is used to measure the similarity between the predicted class 
label and its true label. Also note that the dependence on a in © and 
later (|5|). (fill) relies on the dependence of the kernel function g a {-). In our 
experiments, the a value is calculated as a = 2xLxN 
Y h ||| following 


3. 


Ef=iEf=i 11-^0/i 


Predictor Regularization To control the complexity of the Z-th predictor in¬ 
dependently, we introduce the Z 2 -based regularizer ||iuz|| 2 to the coefficient 
vector wi of the Z-th predictor. We assume that the predictors of different 
classes are equally important, and the following regularizer is introduced 
for multi-class learning problem: 


r min y^lkzll 2 

i w i}i=i L l=1 


(7) 


Remark 3:The l 2 norm is also used by support vector regression as a 
measure of model complexity. However, in support vector classification, 
this regularization term is either obtained by a “maximal margin” regular¬ 
ization or obtained by a “maximal robustness” regularization for certain 
type of feature noises [17| . Thus our l 2 norm regularization term can also 


be regarded as a term to seek maximal margin or robustness. 


Remark 4: The ^-regularization is used in comparison to the Zi-regularization 
in our model. Using Zi-regularization we can seek the sparsity of the pre¬ 
dictor coefficient vector, but it cannot guarantee the minimal model com¬ 
plexity, maximal margin or maximal robustness like the Z 2 -regularization, 
thus we choose to use the / 2 -regularization. In the future, we will ex- 
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plore the usage of ii-regularization to see if the prediction results can be 
improved. 

By substituting 6 = {w^bi}^, Fg u = f l Wi bl (xi), and combining both the 
predictor regularization term in J7]) and the prediction accuracy criterion term 
based on correntropy in f^]), we obtain the following maximization problem for 
the maximum correntropy machine: 


N 


~ Yli ) ~ a j J2^ Wl ^ 


where a is a tradeoff parameter. This optimization problem is based on cor¬ 
rentropy using a Gaussian kernel function g a (x). It treats the prediction of 
individual training samples of individual classes differently. By this way, we 
can give more emphasis on samples with correctly predicted class labels, while 
those noisy or outlying training samples will have small contributions to the 
correntropy. In fact, when the regularizer term is introduced, © is a case of 
the regularized correntropy framework [l^. 


2.2. Optimization 

Due to the nonlinear attribute of the kernel function g cT (x) in the objective 
function in ©, direct optimization is difficult. An attribute of the kernel func¬ 
tion g a (x) is that its derivative is also the same kernel function, and if we set 
its derivative to zero to seek the optimization of the objective, it is not easy to 
obtain a close form solution. However, according to the property of the convex 
conjugate function, we have: 

Proposition 1 There exists a convex conjugate function ip of g a {x) such that 


g a (x) = max p (p\\x\\ 2 - tp(p)) (9) 

and for a fixed x, the maximum is reached at p = —g a (x). This Proposition 
is taken from |18j . which is further derived from the theory of convex 
conjugated functions. It is further discussed and used in many applications 



By substituting ([2]) to ©, we have the augmented optimization problem in 
an enlarged parameter space 


max 


{(v>iM)}U’ pLxN 


L N L 

[ P nWf l w,M( X i) ~ Y n \\ 2 


1=1 


L N 1 L 

[ P li\\ w 7 + b l - Y li\\ 2 - <P( P li)] ~ a J I 2 ’ 


LxN 


1 = 1 1=1 


1=1 


( 10 ) 

where P = [Pu] G R Wxi are the auxiliary variable matrix. To optimize (flOl) . 
we adapt the EM framework to solve P and { (u>i. bi)}f =1 alternately. 


2.2.1. Expectation Step 

In the expectation step of the EM algorithm, we calculated the auxiliary 
variable matrix P by fixing 9. Obviously, according to Proposition 1, the 
maximum of chd can be reached at 


P = -ga(Fe-Y), 


Pli = ~9a{wJ Xi + bi- Yu). 


( 11 ) 


Note that g a {X) is the element-wise Gaussian function. With fixed predictor 
parameters, the auxiliary variable —Pu can be regarded as confidence of pre¬ 
diction result of the i-th training sample regarding to the Z-t.h class. The better 
the Z-th prediction result of the i-th sample fits the true label Yu, the larger the 
—Pu will be. 

Remark 5: It is interesting to see if there is any relation between the auxil¬ 
iary variables in P and the slack variables in SVM. Actually, both the auxiliary 
variables in P and the slack variables in SVM can be viewed as measures of 
classification losses. The slack variables in SVM are the upper boundaries of 
hinge losses of the training samples, while the auxiliary variables in P are a dis¬ 
similarity measure between the predicted labels and the true labels under the 
framework of the MCC rule, which is also a loss function. Meanwhile, the aux¬ 
iliary variables in P also play a role of weights of different training samples as 
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in m ■ so that the learning can be robust to the noisy labels, but the auxiliary 
variables in SVM do not have such functions. 

Remark 6: In the expectation step, we actually solve an alternative opti¬ 
mization of solving P while fixing {(«;;, bi)}f =1 . However, according to Propo¬ 
sition 1, the solution for this optimization problem is in the form of ED. which 
can be calculated directly and makes it an expectation step of the EM algorithm. 

2.2.2. Maximization Step 

In the maximization step of the EM algorithm, we solve the predictor pa¬ 
rameters {{wi, bi)}^ =1 while fixing P. The optimization problem in ED turns 
to 


max 


L N 


( 12 ) 

Noticing Pu < 0 and removing terms irrelevant to wi and bi, the maximization 
problem in (1121) can be reformulated as the following dual minimization problem: 


min 0(uii,&!,••• ,w L ,b L ), 

1 L N L 

0(wi,bi,--- ,w L ,b L ) = L x N ^2^2{-Ph\\wJXi + bi - Yu\\ 2 ) + IM| 2 . 

1 = 1 i= 1 1=1 

(13) 

To simplify the notations, we define a vector ui = [un, • • • , w;at] t G so that 
Uy = — jt Pu. With m, the objective function in (fl3l) can be rewritten as 


1 v , 

0(w!,bi, ■ ■ ■ ,w L ,b L ) =— ^2 [II u H (wJxi + bi — Y u )|| 2 + a||u>i|| 2 ] 
n 1=1 

1 L _ _ _ _ 

=— ^2 [(wj x i + biuj - Y{){wJXi + bmj - Yi) t + awjw{\ 

^ i=i 


(14) 

where Xi = [unx i, • • • , it/jvXjv] G R DxN is the matrix containing all the training 
sample feature vectors weighted by ui, and Yi = [unYn, ■ ■ ■ ,uinYin] G is 
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the l -th row of Y weighted by ui. 

Obviously, the optimization problem in (1131) is a linear least squares problem. 
Analytical solution for Problem (fill) could be obtained easily. By setting the 
derivative of 0{w i, b\, ■ ■ ■ , wl , &l) with regard to 6/ to zero, we have 


where y l 
we have 


dO{w 1 ,b 1 , ■ ■ ■ ,w L ,b L ) 1 T T 

- dbi - = 2L( 1 Xl + blUl ~ Yi ’ 1n = 0 


=> bi 


roiv 

In 


{Yt-wJXj) 1 
uj 1 N 

andirz = . 

" 'll' 1 ivr 


N _ T— 

— = Vi - Wl Xl, 

By substituting m to 0(w\,b 


l) • • 


(15) 


,w L ,b L ), 


l ^ ^ _ _ _ _ 

0(w\, ■ ■ • , w L ) =- {K T ( x i - XiuJ ) - (Yi - y t uj)][wj{Xl - x t uj) - {Y ; - y t uj )] T + awj w t } 

1=1 

( 16 ) 

By setting the derivative of 0{w±, ■ • • , wl) with regard to wi to zero, we have 
the optimal solution w* 

1 'dw — = 2 ~ ^ lU ^~ ^ u ^ Twi ~ ~ ^ lU i )(^ ! - yi u ^) T + ^OiWi] = 0 

=> wl = [{Xi -xmJ){Xi — XiuJ ) T + aI]~ 1 {Xi -xmJ){Yi -yiuJ ) T , 

( 17 ) 

where I is an D x D identity matrix. Then we substitute w* to m , and we 
will have the optimal solution of b*, 

tf = Vi- w* T xi ( 18 ) 


2.3. Algorithm 

Algorithm 1 summarizes the predictor parameter learning procedure of Reg- 
MaxCEM. The E-step and the M-step will be repeated for T times. 
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Algorithm 1 RegMaxCEM Learning Algorithm. 

Input: Training set: V = {(a;*, 

Initialize the auxiliary variable matrix P° = — IlxJv! 

Represent each sample X{ as Xj. as in ©; 
for t = 1, • • • , T do 

Maximization-Step: Update the predictor parameters = {(tu*, 6*)}^ 
as in m3 and (fl8l) by fixing P* 1 . 

Expectation-Step: Update the auxiliary variable matrix P* as in m 
by fixing the predictor parameters 0*. 

end for 

Output: Predictor parameters 6 T = { (wf , w=i- 


3. Experiments 

In the experiments, we will evaluate the proposed classification method on 


two challenging pattern classification tasks — bacteria identification 
prediction of DNA-binding sites in proteins 22]. 


21] and 


3.1. Experiment I: Bacteria Identification 

3.1.1. Dataset and Setup 

High-precision identification of bacteria is quite important for the diagno¬ 
sis of cancers and bacterial infections. Recently, ensemble aptamers (ENSap- 
tamers), which utilizes a small set of nonspecific DNA sequences, has been 
proposed to provide an effective platform for the detection of bacteria 21 j. EN- 


Saptamers is a sensor array with seven sensors, and each sensor is designed using 
a DNA element. 

For the experiment, we collected in total 66 samples of 6 different bacteria, 
including S.tyohimurium, S.flexneri, E.coli (CAU 0111), S.sonnei, S.typhi and 
E.coli (ATCC 25922). The number of samples for each bacteria varies from 9 to 
13. Given an unknown bacteria sample with its fluorescence response patterns 
of ENSaptamer, the task is to identify which bacteria it is. To this end the 
seven fluorescence response patterns of ENSaptamer against the sample will be 
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used to construct the 7-dimensional feature vector, and then the sample will be 
classified into one of the known bacteria using the RegMaxCEM predictor. 

To conduct the experiment, we randomly split the entire dataset into two 
non-overlapping subsets — the training set and the test set. 33 samples were 
used as training sample in the training set, while the remaining 33 ones as 
test samples. The predictor parameters of RegMaxCEM were trained using the 
feature vectors and class labels of the training samples. Then the class labels of 
the test samples were predicted by the trained predictor, and compared to their 
true labels to calculate the classification accuracy. The random split process 
(training/test) was repeated for ten times and the accuracies over these ten 
splits were reported as classification performance. 

3.1.2. Results 

We compare our proposed method against other loss function based classi¬ 
fiers, including square loss, hinge loss and logistic loss. 0-1 loss is the simplest 
loss function, but difficult to optimize, thus is not compared in the experiment. 
The boxplots of accuracies of different methods using both linear and kernel 
representations are illuminated in Figure [1] As shown in Figure [Q predictor 
produced by maximizing the correntropy yields improvements over other loss 
functions. Given the extremely small variation of classification accuracies over 
the ten splits, though the improvement of the accuracies are not large in abso¬ 
lute terms (around 0.1), it is consistent and significant. To verify whether the 
improvements are statistically significant, we performed the paired t-tests to 
the accuracies of the proposed method and other compared methods. The null 
hypothesis of the T-test is that the accuracies of the proposed method and the 
compared methods come from distributions with equal means. The P values 
of the t-tests are reported as measurements of statistically significance. A low 
P value implies that the difference between the proposed method and the com¬ 
pared methods are statistically significant. The P values are reported in Table 
[2] As we can see from the table, all the improvements archived by RegMax¬ 
CEM, for both linear representation and kernel representation, are statistically 


13 


significant at the 0.05 significance level. This is not surprising: There are some 
noisy and outlying samples in the training set, which have been utilized by the 
methods with square loss, hinge loss or logistic loss as equally as other sam¬ 
ples, thus they bring some bias to the predictor. However, the RegMaxCEM 
has the potential of filtering these samples, which can result in reliable learning 
of predictors in practice. It is also interesting to notice that the square loss, 
hinge loss and logistic loss have archived very similar classification accuracies. 
Though they used different loss functions, these loss functions are applied to 
the training samples equally. 

Bacteria Identification: Linear Representation 


1 



0.7-i-i-i-i- 

RegMaxCEM Square Loss Hinge Loss Logistic Loss 


(a) Linear representation 


Bacteria Identification: Kernel representation 



RegMaxCEM Square Loss Hinge Loss Logistic Loss 

(b) Kernel representation 

Figure 1: Boxplots of accuracies of bacteria identification. 
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Table 2: P values of paired T-tests on accuracies of ten splits of RegMaxCEM and compared 
methods on bacteria identification._ 


Linear representation 

Kernel representation 

Compared methods 

P values 

Compared methods 

P values 

Square Loss 

0.0266 

Square Loss 

0.0118 

Hinge Loss 

0.0243 

Hinge Loss 

0.0224 

Logistic Loss 

0.0115 

Logistic Loss 

0.0095 


3.2. Experiment II: DNA-Binding Site Prediction 

It is very important to predict the DNA-binding sites in proteins for under¬ 
standing the molecular mechanisms of protein-DNA interaction. In this experi¬ 
ment, we will evaluate the proposed method for prediction of DNA-binding sites 
22 ]. 


3.2.1. Dataset and Setup 

The PDNA-62 database for DNA-binding site prediction has been used in 
this experiment. This database contains 8,163 sites in proteins in total. Among 
these sites, 1,215 of them are DNA-binding sites, while the remaining 6,948 
sites are non-binding sites. We select 1,000 DNA-binding sites and 5,000 non¬ 
binding sites from the PDNA-62 database to construct our database for the 
experiment. Given a candidate site, the goal of DNA-binding site prediction is 
to predict whether it is a DNA-binding site or not. To this end, the evolutionary 
information, solvent accessible surface area and the protein backbone structure 
features were extracted from the site, and then combined to construct the feature 
vector. The feature vector was further inputted into the classifier to distinguish 
DNA-binding sites from the non-binding sites 22]. 


To conduct the experiment, we employed the 10-fold cross validation. The 
database was split into 10 non-overlapping folds randomly, one of which was 
used as the test set, while the rest 9 of them were used as the training set. The 
procedure was repeated for 10 times so that each fold was used as the test set 
once. 


15 















The prediction performance was measured by the receiver operating charac¬ 
teristic (ROC) and recall-precision curves. The usage of ROC curve is mainly 
due to the imbalanced classes. The ROC curve is created by plotting false posi¬ 
tive rate (FPR) against true positive rate (TPR), while recall-precision curve is 
obtained by ploting recall against precision. The FPR, TPR, recall and precision 
are defined as: 


FP TP 

FPR =-, TPR = -, 

FP + TN TP + FN ’ 

pp TP 

recall = — -—, precision = —-, 

TP + FN'* TP + FP ’ 

where TP is the number of DNA-binding sites predicted correctly, FP is the 
number of non-binding sites predicted as DNA-binding sites wrongly, TN is 
the number of non-binding sites predicted correctly, while FN is the number of 
DNA-binding sites predicted as non-binding sites wrongly. For a better predic¬ 
tor, its ROC curve should be closer to the top left corner of the figure, while the 
recall-precision curve should be closer to the top right corner. Besides the two 
curves, area under the ROC curve (AUC) is also used as a single measurement 
of the prediction. A better predictor will have a larger AUC value. 

3.2.2. Results 

The ROC and recall-precision curves of the proposed method and compared 
methods are reported in Figure [2] The predictors using linear and kernel rep¬ 
resentations are both illuminated. The AUC values of the ROC curves are 
reported in Table [3] as well. Overall the proposed methods clearly outperform 
the other methods significantly, although there is some variability in predic¬ 
tion performance over different representation types. From Table [31 we could 
see that the accuracy of the predictor is slightly increased by using the kernel 
representation instead of the linear representation. The regularized correntropy 
based predictors gives much better results than other methods on both rep¬ 
resentations. An interesting result from the DNA-binding prediction on this 
dataset is that the predictor with the hinge loss function outperforms other two 
methods. 
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ROC 


Recall-Precision 




(a) ROC of linear presentation 


ROC 



(b) Recall-precision curve of linear presen¬ 
tation 


Recall-Precision 



(c) ROC of kernel presentation (d) Recall-precision curve of kernel presen¬ 

tation 


Figure 2: ROC and recall-precision curves on DNA-Binding site prediction experiment using 
both linear and kernel representations. 
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Table 3: AUC values of ROC curves on DNA-Binding site prediction experiment. 


Linear representation 

Kernel representation 

Mehtods 

AUC 

Mehtods 

AUC 

RegMaxCEM 

0.9226 

RegMaxCEM 

0.9344 

Square Loss 

0.8768 

Square Loss 

0.8891 

Hinge Loss 

0.8908 

Hinge Loss 

0.8961 

Logistic Loss 

0.8747 

Logistic Loss 

0.8776 


4. Conclusion and Future Work 


In this paper, we present a novel regularized predictor learning model for 
multi-class pattern recognition problems. The predictor is learned by maximiz¬ 
ing the correntropy between the prediction results and the true class labels. By 
applying the MCC rule, we could treat different training samples differently, so 
that the noisy and outlying training samples have less impact on the learning of 
predictors. Compared with the existing predictor models with various loss func¬ 
tions, it is robust to the noisy and outlying training samples. The experiments 
on bacteria identification and DNA-binding site prediction show that a good 
predictor may benefit much from a well designed loss function based on MCC. 
The proposed method outperformed the predictor with other popularly used loss 
functions. In the future, we will investigate if the regularized maximum corren¬ 
tropy framework can be used to reg ularize ranking score learning [ 23 , 3[, data 
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